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The convolutional neural network (CNN) is a deep neural network used for 
many purposes, such as image classification, detection, and face recognition, 
Keywords: due to its high-performance accuracy in classification and detection tasks. In 
this paper, we develop CNN based on the transfer learning approach for 
image classification. The network comprises two types of transfer learning, 
ResNet and DenseNet, as building blocks of the network with an multilayer 
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DenseNet perceptron (MLP) classifier. The proposed method does not need to 
Multi-layer perceptron preprocess before these datasets that input into the network. It was train on 
ResNet two datasets: the Cifar-10 and the Sign-Traffic datasets. We conclude that 


the proposed method achieves the best performance compared with other 
states of the art. The accuracy gained is 97.45% and 99.45%, respectively, 
where the proposed CNN increased the accuracy compared to other methods 
by 3%. 
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1. INTRODUCTION 

A large volume of published studies classifying images is a significant problem in artificial vision 
systems and has been for decades. This area aims to provide a label to a picture based on the information visible 
[1]. Researchers may benefit from image classification since it allows them to organize images according to 
their shared characteristics. For example, if images A and B have specific characteristics, we may classify and 
label them as part of the same set. Their studies tested the algorithms in various ways and made comparisons 
[2]. Object detection and classification are among the most challenging tasks regarding image processing. 
Several object classification methods have been suggested throughout the years to address these issues [3]. 

Researchers and developers are now able to approach bigger models to handle complicated issues, 
something that was previously impossible with traditional artificial neural networks (ANNs) [4]. Most studies 
have used hand-crafted features like histogram of oriented gradients (HoG) [5] or scale-invariant feature 
transform (SIFT) [6] to characterize a picture with discriminatory power. Next, the collected features are fed 
into a learnable classifier, such as a support vector machine, a random forest, or a decision tree. 

However, it becomes a highly challenging challenge to discover characteristics from a large number 
of provided photos. For these and other reasons, a new model based on deep neural networks is in the future. 
Convolutional neural network (CNN) is widely used for image identification and is one of the most well- 
known deep neural networks. Many computer vision and natural processing applications, such as image 
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identification, and object identification, have benefited from its utilization [3]. Furthermore, CNN provides 
outstanding efficiency in solving machine learning issues. For instance, a complete image classification 
dataset is useful for programs with images. In the previous decade, CNN has seen widespread use in the 
effort to enhance picture categorization precision. Since CNN permits the cooperative learning of features 
and classifiers, it can provide superior classification accuracy for big data sets [7]. The bag-of-features 
pipeline has recently been used in picture classification approaches. Clustering is performed using SIFT 
descriptors [8]. Features are collected via spatial pooling [9] histogram encoding [10] and, most recently, 
fisher vector encoding [11]. While these representations have been shown to provide workable outcomes, it is 
not immediately clear whether or not they are ideal for the job at hand. This requires a lot of time and effort, 
not to mention the cost of hiring specialized personnel. The AlexNet deep CNN by Krizhevsky et al. [12] 
stands out as the most novel of these networks (i.e., graphics processing unit (GPU), an intense network of 60 
million and 650,000 neurons). AlexNet embraced the challenge, outperforming its rivals and achieving a top- 
five error rate of just 15.3%. The error rate in the top 5 spots was close to 26.2%; this was not a CNN variant. 
Gehring et al. [13] developed a CNN architecture for learning in sequence. The model outperforms the 
recurrent models, which failed to understand the compositional nature of the sequences. In addition, all of the 
components may be parallelized entirely during training for more efficient calculations. 

To further facilitate a more organic optimization, nonlinearities are made constant and independent 
of the input length in. Ye et al. [14], developed an alternative approach to CNN. They detailed the pixel-by- 
pixel operation of the CNN and showed its use in several contexts. Since other CNN kinds have more 
features and processing capacity, they are employed by many academics as primary image classifiers in their 
studies. The results of a comparison of the suggested approach with others indicated that it was superior. To 
categorize pictures more quickly and accurately than previous models, Han et al. [15] suggested a novel 
CNN approach, which they tested on six distinct small-sized datasets to verify their findings. After analyzing 
the outcomes, they concluded that this strategy is simple for small datasets. The novel-based model for image 
classification and multi-label method was given by Song et al. [16]. The basic premise of this study was to 
train a model using various data sources, including multi-label picture data. When many labels are needed, 
this study might be helpful. According to the paper's declared accuracy parameters, its offered model beats 
other presented models. In a recent study, authors M. A method for classifying images on embedded systems 
was developed and proven in a study by Calik and Demirci [17]. The researchers employed CNN to achieve 
an accuracy of 85.9% on the Cifar-10 dataset. In their paper "Empirical study of the output of common 
convolution neural networks for object identification in real-time video feeds," Sharma et al. [18] published 
the results of such a study. 

The following is the structure of the paper: section 2 summarizes the components used to construct 
the suggested model. Section 3 discusses the proposed classification strategy for photographs. Section 4 
describes the experimental setup, which includes the datasets that are used to train the proposed model, the 
results, the evaluation metrics that evaluate the accuracy and loss of the model, the dissection of the results, 
and the comparison of the accuracy of the proposed model to that of other previous models. Section 5 
concludes the model proposal. 


2. METHOD 
2.1. DenseNets and ResNets blocks of suggested framework 

ResNet [19] and DenseNet [20] are two CNNs suggested as cutting-edge in recent years. ResNet 
and DenseNet are two successful deep learning architectures primarily related to their respective building 
components, ResNet blocks (RBs) and DenseNet blocks (DBs). Figure 1 shows an example of RB composed 
of three convolutional layers2 and one skip connection. The names of the convolutional layers are Conv1, 
Conv2, and Conv3. On Convl, a reduced number of filters with a size of 1x1 minimizes the spatial 
dimension of the input in order to reduce the problematic computational of Conv2. On Conv?2, filters with a 
larger size, such as 3x3, are used to learn spatially identical characteristics. On Conv3, a filter size of 1x1 is 
employed again, and this time the spatial dimension is raised so that more characters may be generated. The 
output of Conv3 is combined with the input to form the output of the RD. In case the input and Conv3's 
output spatial sizes are different, a series of convolutional operations with 1x1 sized filters are performed on 
the input to attain the same dimensionality as the result of Conv3 for the sum. Figure 2 displays an example 
of a DB. In the interest of simplicity, the DB has just four convolutional layers. In practice, the number of 
convolutional layers in the DB may be adjusted by the user. Each convolutional layer in the DB takes inputs 
from the input data and the output of all previously convolutional layers. Attempts in [21], [22] have 
investigated the mechanism underlying the success of running backs and safeties, revealing that RBs and 
DBs can decrease the negative effect of the gradient's vanishing problem [23], based on which a deep 
architecture is possible to efficiently learn the classification tasks of the training dataset and subsequently 
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enhance the classification precision. Additionally, it has been suggested that dense connections in DBs may 
reuse low-level attributes to improve the acquired discrimination of characteristics in the upper layers of 


CNNs [20]. The suggested method selects RBs and DBs as the basis primarily due to their positive features. 


I/P ——* Convl —>| Conv2 >| Conv3 Le ae I/P T Conv1 LLT Conv2 PR Cony3 Conv4 | > O/P 
Figure 1. RB Figure 2. DB 


2.2. Suggested framework for image classification 

We suggest a framework based on a model of deep neural networks with four blocks. After a fully 
connected layer and a multi-layer perceptron as a classifier using softmax as the activation function, the 
network consists of the first two blocks from ResNet50 and the second two blocks from DenseNet121. By 
default, these blocks share the same configuration parameters. In both the Cifar-10 and the Sign-Traffic 
datasets, this CNN has shown effective during training. In every test, we split our dataset into two halves: 
training and validation. The best model is selected in CNN training after the first 30 iterations have the 
lowest validation loss. The architecture accepts images of varying sizes as input, and the input images are 
zero-paddeding. Figure 3 iullstrate the suggested network and Table 1 iullstrate the seetings of the blocks 
used in dsign the architecture. 
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Figure 3. The proposed architecture of the model 


Table 1. The seetings of building blocks of the model 


Model Building block Parameter Classifier 
R1-D1-R2-D2 R1 Normlization (batch normalization, local MLP 
(1 x 1,64) response normalization) 
Conv2D< (3 x 3,64) >x3 
(1 x 1,256) 
R2 Activation function (leaky relu) 
(1,1,128) Kernal size (7x7) 
Conv2D 4 (3,3,128)}x 3 Stride (2,2) 
(1,1512) Filter (64) 
Pooling (average pooling) 
DiI Kernal size (2x2) 
(1 x 1) Stride (2x2) 
Conv2D x6 Filter! (56) 
(3 x 3) Filter2 (28) 
D2 Padding (zero padding) 
(1x1) Regulerization (dropout=0.5) 
Conv2D x12 Optimizer (Adam) 
(3 x 3) 


Pooling (average pooling) 


3. BENCHMARK DATA SETS 
3.1. Cifar-10 

Cifar-10 is a dataset of natural RGB images of 32x32 pixels [24]. It contains 10 classes with 50,000 
training images and 10,000 test images. All of these images have different backgrounds with different light 
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sources. Objects in the image are not restricted to the one at center, and these objects have different sizes that 
range in orders of magnitude. The dataset Cifar-10 contains 60,000 color images, with a training set 
comprising of 50,000 images, a test set containing 10,000 images, all within twenty object classes in ten 
broad categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck as shown in Figure 4 
with images of size 32x32 pixles. 
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Figure 4. Some images of Cifar-10 dataset [24] 


3.2. Sign signal 
The traffic sign dataset [25] contains more than 360 images in total, divided into different classes. 
To avoid using the testing data, we leave 180 images from the training set for validation and 180 test images 


featuring among four classes "stop sign", "non stop sign", "green light" and "red light". Both training and 
testing data are distributed over these categories as shown in Figure 5. 


NonStoplmages Greenlight 


stopSignimages 


Figure 5. Some image from sign-traffic dataset [25] 


4. EXPERIMENT RESULTS AND DISCUSSION 

This experiment uses to classify multi class images. In this part, we compare the proposed method 
with other methods described for image classification in the literature, demonstrating that the suggested 
network has enough performance for our current needs. Cifar-10, widely utilized in detection and 
classification applications, and the sign-traffic dataset are used to evaluate the efficacy of the proposed 
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architecture. This model is built using the Keras library and the Google TensorFlow framework on a machine 
with 16 GB of RAM and an NVidia GEFORCE GTX 1,650. A learning rate of the Adam optimizer was 
utilized (0.001), and a batch size of 50 samples was also utilized. Moreover, the model uses the (categorical 
cross-entropy) loss function. The drop-out is (0.5) used to avoid overfitting. 

We used MPL as a classifier, which is expected since it correlates the feature non-linearly to 
generate all possible patterns. The accuracy receiver operating characteristic (ROC) curve results of Cifar-10 
dataset are shown in Figure 6(a), while the error rate ROC curve results of Cifar-10 are shown in Figure 6(b). 
Moreover, the accuracy ROC curve result of sign-traffic shown in Figure 7(a), and the error rate ROC curve 
results of sign-traffic shown in Figure 7(b). Table 2 shows the proposed network outperforms competing 
methods with a classification accuracy of 97.45% of Cifar-10 and 99.45% for sign-traffic datasets. The 
generated findings are reasonably stable and accurate, giving valuable insights into the classifying 
performance of the images.maps. 


Table 2. The result of the propsed model on two datasets 


Model Datasets Accuracy (%) _ Error rate 
Proposed model Cifar-10 97.45 0.14 
Proposed model Sign-Traffic 99.45 1.72x107* 
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Figure 6. The result of the two datasets (a) the model accuracy of Cifar-10 dataset and (b) the model accuracy 
of sign-traffic dataset 
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Figure 7. The result of the two datasets (a) the loss of Cifar-10 datasets and (b) the loss of sign-traffic datasets 


4.1. Comparisons with state-of-the-art works 

To check further the performance of the proposed CNN model. The comparisons are compared 
among proposed CNN and some state-of-the-art works. Note that handcrafted-AE-CNN by Sun, also 
compared with CNN prposed by Yim et al. [1], and with proposed network by Aamir et al. [3] are designed 
for Cifar-10 dataset classification tasks, so they cannot converge for our forensics task. In addition compare 
the results of sign-traffic datasets with the proposed CNN by Jmour and Zayen. Table 3 report the accuracy 
of multi classification on these dataset. We can see that proposed CNN can obtain the best results in multi 
classification tasks. 
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Table 3. Comparison of proposed model with other models 
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Model Datasets Accuracy (%) Error rate 
AE-CNN Cifar-10 95.03 0.04 
CNN Cifar-10 78.29 0.22 
CNN Cifar-10 95.53 0.04 
Proposed model Cifar-10 97.45 0.03 
AlexNet Sign-Traffic 93.33 0.07 
Proposed model Sign-Traffic 99.45 0.005 


4.2. Evaluation metrics 

We consider our work's accuracy (ACC) and error rate (ERR) metrics to evaluate the model's 
efficiency. The accuracy helps to know the errors in the measurement values of the models. Accuracy and the 
error rate are inversely related. High accuracy refers to a low error rate, and a high error rate refers to low 
accuracy. ACC is derived by dividing the total number of accurate predictions by the total number of 
observations in the dataset (shown in (1)). ERR is computed by dividing the total number of inaccurate 
predictions by the total dataset (shown in (2)). 


acc= 22 (1) 
ERR= ZEN (2) 
P+N 


Where (TP + TN) the correct prediction, (FP + FN) the incorrect prediction, (P + N) the total number of the 
datasets. Which (TP, TN) are taken from confusion matrix are shown in Table 4. 


Table 4. Confusion matrix for showing results of classification 
Predictions 
Real Fake 
Real True positive (TP) False negative (FN) 
Fake False positive (FP) True negative (TN) 


Actual 


5. CONCLUSION AND FUTURE WORK 

In a study, we developed a technique that employs a deep neural network and consists of two blocks 
of two transfer learning approaches, namely ResNet and DenseNet, followed by a fully connected layer. This 
approach is implemented on Cifar-10 and sign-traffic datasets for training and testing. This kind of learning, 
namely the CNN, is used to identify image data. We have shown that fine-tuning settings is a crucial and 
beneficial training strategy. Based on these results, it can be concluded that image categorization using deep 
neural networks can achieve high performance. The suggested network needed fewer processing resources 
and less memory. The network enhances classification accuracy and yields acceptable identification 
outcomes compared to conventional methods. In addition, the network's performance assessment indicates 
that it may be used to construct a considerably better classifier. In future work, the researchers can use 
another transfer learning network instead of ResNet and DenseNet for image classification. 
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